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Abstract. Weighted sampling without replacement has proved to be a 
very important tool in designing new algorithms. Efraimidis and Spirakis 
(IPL 2006) presented an algorithm for weighted sampling without replace¬ 
ment from data streams. Their algorithm works under the assumption of 
precise computations over the interval [0,1]- Cohen and Kaplan (VLDB 
2008) used similar methods for their bottom-k sketches. 

Efraimidis and Spirakis ask as an open question whether using finite pre¬ 
cision arithmetic impacts the accuracy of their algorithm. In this paper 
we show a method to avoid this problem by providing a precise reduction 
from k-sampling without replacement to k-sampling with replacement. We 
call the resulting method Cascade Sampling. 


1 Introduction 


Random sampling is a fundamental tool that has many applications in computer 
science (see e.g., Motwani and Raghavan [T^], Knuth [3], Tille m, and Olken 
m)- Random sampling methods are widely used is data stream processing be¬ 
cause of their simplicity and efficiency |14l8l7l6ll0lllj . In a stream, the size of 
the domain and the probability of sampling an element both change constantly; 
this makes the process of sampling non-trivial. We distinguish between sampling 
with replacement where all samples are independent (and thus can be repeated), 
and sampling without replacement where repetitions are prohibited. 

In particular, weighted sampling without replacement has proven to be a very 
important tool. In weighted sampling, each element is given a weight, where the 
probability of an element being selected is based on its weight. In their work 
Efraimidis and Spirakis [5] presented an algorithm for weighted sampling with¬ 
out replacement. Cohen and Kaplan [5] use similar methods for their bottom- 
k sketches. While their preliminary implementation yielded promising results. 
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Efraimidis and Spirakis [5] state, as the main open problem of the paper, “How¬ 
ever, the question if, and to what extent, the finite precision arithmetic affects 
the algorithms remains an open problem. ” 

In this paper we continue this work and provide a new algorithm to avoid 
the issue of relying on finite precision arithmetic. With this result we show that 
precision loss is not required in order to sample without replacement. We accom¬ 
plish this by providing a precise reduction from fc-sampling without replacement 
to fc-sampling with replacement, using a special case of fc-sampling with replace¬ 
ment, unit sampling (where fc=l). Additionally, we believe that in the future our 
method of expressing different random samples via reduction will provide a tool 
that allows further translation of other sampling methods into a more effective 
form for streams. 

1.1 Related Work 

Due to its fundamental nature, the problem of random sampling has received 
considerable attention in the last few decades. 

In 2005, Vitter |Ihj presented uniform sampling using a reservoir (with and 
without replacement) over streams. Further, the question of reductions between 
sampling methods has been addressed before. For instance, Chaudhuri, Motwani 
and Narasayya [5] briefly discuss reductions for various sampling methods. Cohen 
and Kaplan [3] use a “mimicking process” in their papers, which is essentially a 
reduction from sampling without replacement to sampling with replacement. 

Chaudhuri, Motwani and Narasayya [2] use the well-known method of “over- 
sampling”, i.e. we sample the set independently until fc distinct elements are 
obtained. Clearly, this schema does not introduce any precision loss, since unit 
sampling is used as a black-box. 

Unfortunately, the amount of resources required to determine this informa¬ 
tion is a function of the weight distribution for the data set, and thus can be 
arbitrarily large. 

In particular, consider the case when there is an element with weight that is 
overwhelmingly larger than the rest of the population. In this case, the number 
of repetitions found while sampling with replacement is significantly larger then 
fc. 

Probably the first effective non-streaming solution for the weighted sampling 
without replacement problem was the algorithm of Wong and Easton |I7j . It 
is used by many other algorithms (see Olken m for the discussion). For data 
streams, Efraimidis and Spirakis proposed an algorithm that is based on the 
“exponent method”. The algorithm requires precise computations of random 
keys where r ~ 17[0,1]. The sample generated is composed of the fc 

elements with maximal keys. Cohen and Kaplan [3] used similar methods as a 
building block for their bottom-k sketches. The bottom-k sketch is an effective 
construction that has been extensively used for various applications including 
approximations of aggregative queries over data streams. As Cohen and Ka¬ 
plan (2 show, these methods are very effective in practical applications and are 
superior to the sketches that are based on sampling with replacement. 
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While effective in practice, the algorithms of Efraimidis and Spirakis and 
Cohen and Kaplan introduce a loss of accuracy, since their techniques require 
additional floating point arithmetic operations. 


1.2 Results 

In this paper we show that the tradeoff between precision and performance is not 
a necessary property of sampling without replacement from data streams and 
construct a precise streaming reduction from fc-sampling without replacement 
to ^-sampling with replacement. This result provides a practical improvement 
to the algorithms of Efraimidis and Spirakis in cases where high accuracy is 
required. 

Our method is yields a surprisingly simple algorithm, given the importance 
of sampling without replacement and the existence of many previous methods. 
We call this algorithm Cascade Sampling. In particular, when used with the 
algorithm from [2] Cascade Sampling requires 0{k) memory, constant time per 
element and the same precision as in [5]. 


1.3 Intuition 

Let A be any algorithm that maintains a unit weighted sample from stream 
D. Similarly to the over-sampling method, we maintain instances of A. Namely, 
we maintain k instances yli,..., . However, we introduce the idea of stream 

modification. That is, instead of applying A independently and symmetrically on 
D, we apply Ai on the modified stream Di that does not contain samples of Aj 
for j < i. In particular, Ai may process its input elements in an order different 
from the order of their arrival in D. This simple but novel idea is sufficient to 
solve the problem. In particular, we can claim that the input of Ai is a random set 
that precisely matches the definition of weighted sampling without replacement. 
Since we use d as a black box with only a constant number of auxiliary variables, 
specihcally pointers, the resulting schema is a precise reduction. 

2 Definitions 

An important building block of our algorithm is the concept of a unit sample, 
that is, the ability to sample a single element from a set. 

Definition 1. Let S be a finite set of elements and let w be a non negative 
function w : S ^ R. A random element Xs with values from S is a unit 
weighted random sample if, for any a € S, P{Xs = a) = Here w{S) = 

Eaes^(^)- 

For an algorithm instantiating weighted unit sampling we provide Black-Box 
WR2 from [2]. Black-Box WR2 is a unit sample when r = 1. 


4 


Algorithm 1 Black-Box WR2: Algorithm for Weighted Unit Sampling 

1 . W^O. 

2. Initialize reservoir with length r = 1, Ao- 

3. For each tuple t in stream: 

(a) Get next tuple t with weight w{t) 

(b) W ^W + wit) 

(c) Set Ao = t with prob. 

4. Return Ao 


Definition 2. A data stream is an ordered, set of elements, Pi,P 2 , ■ ■ ■ ,Pnt 
ean he observed only once. An algorithm A is a streaming sampling algorithm if 
A outputs a sample using a single pass over the data set. 

Definition 3. A set X = {Xi,..., X^} is called a k-sample with replace¬ 
ment from S if Xi,. .. ,Xk are independent random unit samples from S. 

Another fundamental sampling method is weighted sampling without replace¬ 
ment. 

Definition 4. Let S be a finite set such that jS"! > k. An ordered set X = 
is called a fc-sample without replacement from S', IS”! > k if 
Xi is a weighted unit sample from S and for any j > 1, Xj is a weighted unit 
sample from S \ {Xi, ..., Xj-i}. 

Definition 5. We say that there exists an a reduction from a k-sampling to 
a unit sampling if for any unit sampling algorithm A there exists a k-sampling 
algorithm T = T{A) that uses A as a black-box. We say that the reduction is 
precise if for any A that requires memory m and time t: 

1. T{A) requires 0{km) memory and 0{kt) time. 

2. T{A) only uses comparisons (in addition to using A as a black box). 

In other words, T{A) does not introduce any precision loss. 

There exists a (trivial) precise reduction from weighted sampling with replace¬ 
ment to unit sampling. In this paper we give the first precise streaming reduction 
for weighted sampling without replacement to unit sampling. 


3 Cascade Sampling 


Let S be a finite set such that |S| > fc and let a ^ S. Denote T = SU {a}, and let 
w : T 1 -^ be a function. Let {Xi ,..., Xk} be a fc-sample without replacement 
from S with respect to w. Define an ordered sequence {Yi,..., YfeJ^as follows: 


Vi = 


w(a) 

a, w.p. 

Xi, otherwise. 


( 1 ) 


Here the additional randomness is independent. 
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For i>l defin^ 

L, = (2) 

We will show that \Li\ = 1; assuming that, let Zi be the single element from Li, 
i.e., L, = {ZJ. Put U, = T \ {W,..., yj. Define 


y 


i+l — 


ry w(Zi) 

otherwise. 


(3) 


Lemma 1. For all i = 1,... ,k the ordered set {Yi,...,y} is an i-sample with¬ 
out replacement from T with respect to w. 

Proof. We prove the lemma by induction on i. For i = 1 the statement follows 
from direct computation and definitions. Assuming that the lemma is correct for 
i we need to prove that 


y+i eT\{y,...,y}, 


(4) 


and for any b & Up. 

P(Y,^, = b) = ^ (5) 

w{Ui) 

To show 0 observe that {yi,..., y} C {Ai,..., Xi, a} and y+i G {Xi+i,Zi}. 
By definition A^+i ^ {Ai,..., A^, a} and Zi^ {Yi,..., y}. 

To show ([^ fix {Ai,..., Xi} and {y,..., y}; it follows that Zi is fixed as 
well. Denote y = Ui\{Zi-i} and Hi = 5'\{Ai,..., A^}; it follows that Hi = y. 
For any fixed b G Vi we have 


p(y+i = b) = p(A,+i 


b) 


w(y) 

w(y) 


w{b) w(y) 
w{Hi) w{Ui) 


w{b) 

w{Ui)' 


The case b = Zi-i is similar. 


4 Precise Reduction and Resulting Algorithm 

Let A be an algorithm that maintains a unit weighted sample from D. The 
algorithm from [2] is an example of A but our reduction works with any algorithm 
for unit weighted sampling. We construct an algorithm T = T{A) such that Y 
maintains a fc-sample without replacement. Specifically, we maintain k instances 
of A: Ai,...,Ak such that the input of Ai is a random substream of D that 
is selected in a special way. We denote the input stream for Ai as Di. Let Xi 
be the sample produced by Ai. The critical observation is that our algorithm 
maintains the following invariant: at any moment Di = D \ {Ai,..., Ai_i}. 
Thus, by definition, the weighted sample from Di is the i-th weighted sample 
from D when the samples are without replacement. 

Here \ denotes the set difference, i.e. A \ B = {x ■. x ^ A,x ^ B}. 


2 
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Theorem 1. Algorithm Y = T{A) maintains a weighted k-sample without re¬ 
placement from D. If A reguires space 0{g) and time per element 0{h), then T 
requires Olkg) space and 0{kh) time respectfully. Thus, there exists a precise 
reduction from k-sampling without replacement to a unit sampling. 

Proof. Follows from the description of the algorithm (See Algorithm 2) and 
Lemma [H 


Algorithm 2 Cascade Sampling 
Input: Data Stream D = {pi, ... ,p„}, 

A is an algorithm that maintains a unit weighted sample from D, 

Ai,..., Ak are independent instances of A 

Output: Weighted k-Sample Without Replacement {Yi,... ,Yk} 

1. For j = 1,2,... ,n 

(a) new = Pj 

(b) For i = 1,... , mm{j, k} 

i. If {i < j) then set previous = Yi (where Yi the current output of Ai). 

ii. Feed Ai with new 

iii. If Yi changes its value to new, then set new = previous. 

2. Output {Vi,..., Yk} 


Algorithm 2 provides a solution to the weighted fc-Sampling without replace¬ 
ment problem. To better demonstrate the algorithm, we show an example of 
updating a single unit sample inside of loop (b) in Figure 1. In this example, 
unit sample Ai has currently sampled element a and unit sample A 2 has currently 
sampled element b, where a and b are elements that appeared previously in the 
stream. 

4.1 Discussion 

There are several directions in which our algorithm can be improved. In par¬ 
ticular, run time dependent on the number of samples is one issue for practical 
datasets with large k. We believe this can be improved by combining several 
sampling steps into a single step which will be useful for the cases when the 
element will not be sampled into any of the substreams. This will often be the 
cases with elements with small weights. Specifically, we ask if it is possible to 
reduce the total running time from 0{nk) to 0(n log fc). 

Another interesting direction is applying this algorithm to weighted random 
sampling with a bounded number of replacements as shown in [1]. Finally, this 
method may also be interesting when applied to the Sliding Window Model [T] 
and Streams with Deletions [S]. 

We thank our anonymous reviewers for their helpful suggestions, particularly 
for suggesting interesting open problems for discussion. 
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Element a is given 
to A2 to sample 




Fig. 1. Updating a Unit Sample 
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